59 research outputs found

    DISTRIBUTED MULTIDIMENSIONAL INDEXING FOR SCIENTIFIC DATA ANALYSIS APPLICATIONS

    Get PDF
    Scientific data analysis applications require large scale computing power to effectively service client queries and also require large storage repositories for datasets that are generated continually from sensors and simulations. These scientific datasets are growing in size every day, and are becoming truly enormous. The goal of this dissertation is to provide efficient multidimensional indexing techniques that aid in navigating distributed scientific datasets. In this dissertation, we show significant improvements in accessing distributed large scientific datasets. The first approach we took to improve access to subsets of large multidimensional scientific datasets, was data chunking. The contents of scientific data files typically are a collection of multidimensional arrays, along with the corresponding metadata. Data chunking groups data elements into small chunks of a fixed, but data-specific, size to take advantage of spatio-temporal locality since it is not efficient to index individual data elements of large scientific datasets. The second approach was the design of an efficient multidimensional index for scientific datasets. This work investigates how existing multidimensional indexing structures perform on chunked scientific datasets, and compares their performance with that of our own indexing structure, SH-trees. Since R-trees were proposed, various multidimensional indexing structures have been proposed. However, there are a relatively small number of studies focused on improving the performance of indexing geographically distributed datasets, especially across heterogeneous machines. As a third approach, in an attempt to accelerate indexing performance for distributed datasets, we proposed several distributed multidimensional indexing schemes: replicated centralized indexing, hierarchical two level indexing, and decentralized two level indexing. Our experimental results show that great performance improvements are gained from distribution of multidimensional index. However, the design choices for distributed indexing, such as replication, partitioning, and decentralization, must be carefully considered since they may decrease the overall performance in certain situations. Therefore, this work provides performance guidelines to aid in selecting the best distributed multidimensional indexing scheme for various systems and applications. Finally, we describe how a distributed multidimensional indexing scheme can be used by a distributed multiple query optimization middleware as a case-study application to generate better query plans by leveraging information about the contents of remote caches

    Multiple Range Query Optimization with Distributed Cache Indexing

    Get PDF
    MQO is a distributed multiple query processing middleware that can optimize query processing for data analysis applications on the Grid. It has one or more proxies that act as front-end to a collection of backend servers. The basic idea behind this architecture is semantic caching, whereby queries can leverage available cached results in the proxy either directly or through transformations. While this approach has been shown to speed up query evaluation under multi-client workloads, the caching infrastructure in the backend servers is not well used for query planning. In this paper, we describe a distributed multidimensional indexing scheme that enables the proxy to directly consider the cache contents available at the backend servers for planning and scheduling. This approach is shown to produce better query plans and faster query response times. We experimentally demonstrate that system throughput can be improved up to 66%, compared to either load-based or round-robin scheduling

    Indexing Cached Multidimensional Objects in Large Main Memory Systems

    Get PDF
    Semantic caches allow queries into large datasets to leverage cached results either directly or through transformations, using semantic information about the data objects in the cache. As the price of main memory continues to drop and its size increases, the size of semantic caches grows proportionately, and it is becoming expensive to compare the semantic information for each data object in the cache against a query predicate. Instead, we propose to create an index for cached objects. Unlike straightforward linear scanning, indexing cached objects creates additional overhead for cache replacement. Since the contents of a semantic cache may change dynamically at a high rate, the cache index must support fast inserts and deletes as well as fast search. In this paper, we show that multidimensional indexing helps navigate efficiently through a large semantic cache in spite of the additional overhead and overall is considerably less expensive than linear scanning. Little emphasis has been laid upon the performance of multidimensional index inserts and deletes, as opposed to search performance. We compare the performance of a few widely used multidimensional indexing structures with our SH-tree, looking at insert, delete, and search operations, and show that SH-trees overall perform better for large semantic caches than the widely used indexing techniques

    Longitudinal evolution of cortical thickness signature reflecting Lewy body dementia in isolated REM sleep behavior disorder: a prospective cohort study

    Get PDF
    Background The isolated rapid-eye-movement sleep behavior disorder (iRBD) is a prodromal condition of Lewy body disease including Parkinson's disease and dementia with Lewy bodies (DLB). We aim to investigate the longitudinal evolution of DLB-related cortical thickness signature in a prospective iRBD cohort and evaluate the possible predictive value of the cortical signature index in predicting dementia-first phenoconversion in individuals with iRBD. Methods We enrolled 22 DLB patients, 44 healthy controls, and 50 video polysomnography-proven iRBD patients. Participants underwent 3-T magnetic resonance imaging (MRI) and clinical/neuropsychological evaluations. We characterized DLB-related whole-brain cortical thickness spatial covariance pattern (DLB-pattern) using scaled subprofile model of principal components analysis that best differentiated DLB patients from age-matched controls. We analyzed clinical and neuropsychological correlates of the DLB-pattern expression scores and the mean values of the whole-brain cortical thickness in DLB and iRBD patients. With repeated MRI data during the follow-up in our prospective iRBD cohort, we investigated the longitudinal evolution of the cortical thickness signature toward Lewy body dementia. Finally, we analyzed the potential predictive value of cortical thickness signature as a biomarker of phenoconversion in iRBD cohort. Results The DLB-pattern was characterized by thinning of the temporal, orbitofrontal, and insular cortices and relative preservation of the precentral and inferior parietal cortices. The DLB-pattern expression scores correlated with attentional and frontal executive dysfunction (Trail Making Test-A and B: R =β€‰βˆ’ 0.55, P = 0.024 and R =β€‰βˆ’ 0.56, P = 0.036, respectively) as well as visuospatial impairment (Rey-figure copy test: R =β€‰βˆ’ 0.54, P = 0.0047). The longitudinal trajectory of DLB-pattern revealed an increasing pattern above the cut-off in the dementia-first phenoconverters (Pearsons correlation, R = 0.74, P = 6.8 × 10βˆ’4) but no significant change in parkinsonism-first phenoconverters (R = 0.0063, P = 0.98). The mean value of the whole-brain cortical thickness predicted phenoconversion in iRBD patients with hazard ratio of 9.33 [1.16–74.12]. The increase in DLB-pattern expression score discriminated dementia-first from parkinsonism-first phenoconversions with 88.2% accuracy. Conclusion Cortical thickness signature can effectively reflect the longitudinal evolution of Lewy body dementia in the iRBD population. Replication studies would further validate the utility of this imaging marker in iRBD

    Multi-dimensional Range Query Processing on the GPU

    No full text

    A Comparative Study of Spatial Indexing Techniques for Multidimensional Scientific Datasets

    Get PDF
    Scientific applications that query into very large multidimensional datasets are becoming more common. These datasets are growing in size every day, and are becoming truly enormous, making it infeasible to index individual data elements. We have instead been experimenting with chunking the datasets to index them, grouping data elements into small chunks of a fixed, but dataset-specific, size to take advantage of spatial locality. While spatial indexing structures based on R-trees perform reasonably well for the rectangular bounding boxes of such chunked datasets, other indexing structures based on KDB-trees, such as Hybrid trees, have been shown to perform very well for point data. In this paper, we investigate how all these indexing structures perform for multidimensional scientific datasets, and compare their features and performance with that of SH-trees, an extension of Hybrid trees, for indexing multidimensional rectangles. Our experimental results show that the algorithms for building and searching SH-trees outperform those for R-trees, R*-trees, and X-trees for both real application and synthetic datasets and queries. We show that the SH-tree algorithms perform well for both low and high dimensional data, and that they scale well to high dimensions both for building and searching the trees

    Parallel Tree Traversal for Nearest Neighbor Query on the GPU

    No full text
    The similarity search problem is found in many application domains including computer graphics, information retrieval, statistics, computational biology, and scientific data processing just to name a few. Recently several studies have been performed to accelerate the k-nearest neighbor (kNN) queries using GPUs, but most of the works develop brute-force exhaustive scanning algorithms leveraging a large number of GPU cores and none of the prior works employ GPUs for an n-ary tree structured index. It is known that multi-dimensional hierarchical indexing trees such as R-trees are inherently not well suited for GPUs because of their irregular tree traversal and memory access patterns. Traversing hierarchical tree structures in an irregular manner makes it difficult to exploit parallelism since GPUs are tailored for deterministic memory accesses. In this work, we develop a data parallel tree traversal algorithm, Parallel Scan and Backtrack (PSB), for kNN query processing on the GPU, this algorithm traverses a multi-dimensional tree structured index while avoiding warp divergence problems. In order to take advantage of accessing contiguous memory blocks, the proposed PSB algorithm performs linear scanning of sibling leaf nodes, which increases the chance to optimize the parallel SIMD algorithm. We evaluate the performance of the PSB algorithm against the classic branch-and-bound kNN query processing algorithm. Our experiments with real datasets show that the PSB algorithm is faster by a large margin than the branch-and-bound algorithm

    Co-processing heterogeneous parallel index for multi-dimensional datasets

    No full text
    We present a novel multi-dimensional range query co-processing scheme for the CPU and GPU. It has been reported that traversing hierarchical tree structures in parallel is inherently not efficient because of large branching factors. Besides, it is known that the recursive tree traversal algorithm required for multi-dimensional range queries is not well suited for the GPU architecture owing to its small shared memory. In this paper, we propose co-processing range queries using both the CPU and GPU to make the most use of each architecture. In Hybrid tree that we present in this paper, we let CPU navigate the internal nodes of hierarchical tree structures and make GPU scan leaf nodes in a linear fashion using a massively large number of processing units. With the co-processing scheme, we can asynchronously leverage the strengths of each architecture. We also propose a novel dynamic GPU block scheduling algorithm for multiple range queries. In our scheduling algorithm, we consider the selection ratio of each query to determine the number of GPU blocks to launch. By assigning the right number of GPU blocks, we can significantly improve the query processing throughput for multiple concurrent queries. Our extensive experimental study shows that the proposed co-processing scheme shows up to 12?? faster query response time than the state-of-the-art GPU tree traversal algorithm. We also show that our dynamic GPU block assignment algorithm improves the query processing throughput by up to 4??

    Analyzing design choices for distributed multidimensional indexing

    No full text
    Scientific datasets are often stored on distributed archival storage systems, because geographically distributed sensor devices store the datasets in their local machines and also because the size of scientific datasets demands large amount of disk space. Multidimensional indexing techniques have been shown to greatly improve range query performance into large scientific datasets. In this paper, we discuss several ways of distributing a multidimensional index in order to speed up access to large distributed scientific datasets. This paper compares the designs, challenges, and problems for distributed multidimensional indexing schemes, and provides a comprehensive performance study of distributed indexing to provide guidelines to choose a distributed multidimensional index for a specific data analysis application.close2

    Spatial Indexing of Distributed Multidimensional Datasets

    No full text
    While declustering methods for distributed multidimensional indexing of large datasets have been researched widely in the past, replication techniques for multidimensional indexes have not been investigated deeply. In general, a centralized index server may become the performance bottleneck in a wide area network rather than the data servers, since the index is likely to be accessed more often than any of the datasets in the servers. In this paper, we present two different multidimensional indexing algorithms for a distributed environment - a centralized global index and a two-level hierarchical index. Our experimental results show that the centralized scheme does not scale well for either insertion or searching the index. In order to improve the scalability of the index server, we have employed a replication protocol for both the centralized and two-level index schemes that allows some inconsistency between replicas without affecting correctness. Our experiments show that the two-level hierarchical index scheme shows better scalability for both building and searching the index than the non-replicated centralized index, but replication can make the centralized index faster than the two-level hierarchical index for searching in some cases
    • …
    corecore